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Abstract 

Random projection is widely used as a metliod of dimension reduction. In recent years, its combination witli standard 
techniques of regression and classification fias been explored. Here we examine its use for anomaly detection in high-dimensional 
settings, in conjunction with principal component analysis (PCA) and corresponding subspace detection methods. We assume a 
so-called spiked covariance model for the underlying data generation process and a Gaussian random projection. We adopt a 
hypothesis testing perspective of the anomaly detection problem, with the test statistic defined to be the magnitude of the residuals 
of a PCA analysis. Under the null hypothesis of no anomaly, we characterize the relative accuracy with which the mean and 
variance of the test statistic from compressed data approximate those of the corresponding test statistic from uncompressed data. 
Furthermore, under a suitable alternative hypothesis, we provide expressions that allow for a comparison of statistical power for 
detection. Finally, whereas these results correspond to the ideal setting in which the data covariance is known, we show that it is 
possible to obtain the same order of accuracy when the covariance of the compressed measurements is estimated using a sample 
covariance, as long as the number of measurements is of the same order of magnitude as the reduced dimensionality. 

Keywords: Anomaly detection, Principal component analysis, Random projection. 



I. Introduction 

Principal component analysis (PCA) is a classical tool for dimension reduction that remains at the heart 
of many modern techniques in multivariate statistics and data mining. Among the multitude of uses that 
have been found for it, PCA often plays a central role in methods for systems monitoring and anomaly 
detection. A prototypical example of this is the method of Jackson and Mudholkar [fT3l . the so-called PCA 
subspace projection method. In their approach, PCA is used to extract the primary trends and patterns 
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in data and the magnitude of the residuals (i.e., the norm of the projection of the data into the residual 
subspace) is then monitored for departures, with principles from hypothesis testing being used to set 
detection thresholds. This method has seen widespread usage in industrial systems control (e.g.[|6l, [|27l . 
[|3T1l ). More recently, it is also being used in the analysis of financial data (e.g. [fTTI . [|2T]| . EllD and of 
Internet traffic data (e.g. M^, [l20ll). 

In this paper, we propose a methodology in which PCA subspace projection is applied to data that have 
first undergone random projection. Two key observations motivate this proposal. First, as is well-known, 
the computational complexity of PCA, when computed using the standard approach based on the singular 
value decomposition, scales like 0{l^ + Pn), where / is the dimensionality of the data and n is the sample 
size. Thus use of the PCA subspace method is increasingly less feasible with the ever-increasing size nd 
dimensions of modem data sets. Second, concerns regarding data confidentiality, whether for proprietary 
reasons or reasons of privacy, are more and more driving a need for statistical methods to accommodate. 
The first of these problems is something a number of authors have sought to address in recent years (e.g., 
[|35ll . [[T6ll . [|32ll . [[TtII ). while the second, of course, does not pertain to PCA-based methods alone. Our 
proposal to incorporate random projection into the PCA subspace method is made with both issues in 
mind, in that the original data are transformed to a random coordinate space of reduced dimension prior 
to being processed. 

The key application motivating our problem is that of monitoring Internet traffic data. Previous use of 
PCA subspace methods for traffic monitoring [fT9l . [|20l has been largely restricted to the level of traffic 
traces aggregated over broad metropolitan regions (e.g.. New York, Chicago, Los Angeles) for a network 
covering an entire country or continent (e.g., the United States, Europe, etc.). This level of aggregation is 
useful for monitoring coarse-scale usage patterns and high-level quality-of-service obligations. However, 
much of the current interest in the analysis of Internet traffic data revolves around the much finer scale of 
individual users. Data of this sort can be determined up to the (apparent) identity of individual computing 
devices, i.e., so-called IP addresses. But there are as many as 2^^ such IP address, making the monitoring 
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of traffic at this level a task guaranteed to involve massive amounts of data of very high dimension. 
Furthermore, it is typically necessary to anonymize data of this sort, and often it is not possible for 
anyone outside of the auspices of a particular Internet service provider to work with such data in its 
original form. The standard technique used when data of this sort are actually shared is to aggregate 
the IP addresses in a manner similar to the coarsening of geo-coding (e.g., giving only information on a 
town of residence, rather than a street address). Our proposed methodology can be viewed as a stylized 
prototype, establishing proof-of-concept for the use of PCA subspace projection methods on data like 
IP-level Internet traffic in a way that is both computationally feasible and respects concerns for data 
confidentiality. 

Going back to the famous Johnson-Lindenstrauss lemma [O, it is now well-known that an appropriately 
defined random projection will effectively preserve length of data vectors as well as distance between 
vectors. This fact lies at the heart of an explosion in recent years of new theory and methods in statistics, 
machine learning, and signal processing. These include [fTOl , dH. See, for example, the review [|33l . 
Many of these methods go by names emphasizing the compression inherent in the random projection, such 
as 'compressed sensing' or 'compressive sampling'. In this spirit, we call our own method compressed 
PCA subspace projection. The primary contribution of our work is to show that, under certain sparseness 
conditions on the covariance structure of the original data, the use of Gaussian random projection followed 
by projection into the PCA residual subspace yields a test statistic Q* whose distributional behavior is 
comparable to that of the statistic Q that would have been obtained from PCA subspace projection on the 
original data. And furthermore that, up to higher order terms, there is no loss in accuracy if an estimated 
covariance matrix is used, rather than the true (unknown) covariance, as long as the sample size for 
estimating the covariance is of the same order of magnitude as dimension of the random projection. 

While there is, of course, an enormous amount of literature on PCA and related methods, and in 
addition, there has emerged in more recent years a substantial literature on random projection and its 
integration with various methods for classical problems (e.g., regression, classification, etc.), to the best 



of our knowledge there are only two works that, like ours, explicitly address the use of the tools from 
these two areas in conjunction with each other. In the case of the first [|28l . a method of random projection 
followed by subspace projection (via the singular value decomposition (SVD)) is proposed for speeding 
up latent semantic indexing for document analysis. It is shown [28, Thm 5] that, with high probability, 
the result of applying this method to a matrix will yield an approximation of that matrix that is close 
to what would have been obtained through subspace projection applied to the matrix directly. A similar 
result is established in JHl Thm 5], where the goal is to separate a signal of interest from an interfering 
background signal, under the assumption that the subspace within which either the signal of interest or 
the interfering signal resides is known. In both [|28l and (HI, the proposed methods use a general class of 
random projections and fixed subspaces. In contrast, here we restrict our attention specifically to Gaussian 
random projections but adopt a model-based perspective on the underlying data themselves, specifying 
that the data derive from a high-dimensional zero-mean multivariate Gaussian distribution with covariance 
possessed of a compressible set of eigenvalues. In addition, we study the cases of both known and unknown 
covariance. Our results are formulated within the context of a hypothesis testing problem and, accordingly, 
we concentrate on understanding the accuracy with which (i) the first two moments of our test statistic 
is preserved under the null hypothesis, and (ii) the power is preserved under an appropriate alternative 
hypothesis. From this perspective, the probabilistic statements in [|28l , ^ can be interpreted as simpler 
precursors of our results, which nevertheless strongly suggest the feasibility of what we present. Finally, 
we note too that the authors in [[8l also propose a method of detection in a hypothesis testing setting, and 
provide results quantifying the accuracy of power under random projection, but this is offered separate 
from their results on subspace projections, and in the context of a model specifying a signal plus white 
Gaussian noise. 

This paper is organized as follows. In Section |n] we review the standard PCA subspace projection 
method and establish appropriate notation for our method of compressed PCA subspace projection. Our 



main results are stated in Section III where we characterize the mean and variance behavior of our statistic 



Q* as well as the size and power of the corresponding statistical test for anomalies based on this statistic. 



In Section IV we present the results of a small simulation study. Finally, some brief discussion may be 



found in Section |Vj The proofs for all theoretical results presented herein may be found in the appendices. 

II. Background 

Let X G be a multivariate normal random vector of dimension /, with zero mean and positive definite 
covariance matrix S. Let S = VAV'^ be the eigen-decomposition of S. Denote the prediction of X by the 
first k principal components of S as X = {ykV^)X. Jackson and Mudholkar IITBII . following an earlier 
suggestion of Jackson and Morris lfT2l in the context of 'photographic processing', propose to use the 
square of the £2 norm of the residual from this prediction as a statistic for testing goodness-of-fit and, 
more generally, for multivariate quality control. This is what is referred to now in the literature as the 
PCA subspace method. 

Denoting this statistic as 

Q = {X-XY{X-X) , (1) 

we know that Q is distributed as a hnear combination of independent and identically distributed chi-square 
random variables. In particular, 

i=k+l 

where cTj are the eigenvalues of S and the Zi are independent and identically distributed standard normal 
random variables. A normal approximation to this distribution is proposed in fT3l, based on a power- 
transformation and appropriate centering and scaling. Here, however, we will content ourselves with the 

simpler approximation of Q by a normal with mean and variance 

I I 



cTi and 2 ^ or- , 



i=k+l i=k+l 

respectively. This approximation is well-justified theoretically (and additionally has been confirmed in 



preliminary numerical studies analogous to those reported later in this paper) by the fact that I — k 
typically will be quite large in our context. In addition, the resulting simplification will be convenient 
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in facilitating our analysis and in rendering more transparent the impact of random projection on our 
proposed extension of Jackson and Mudholkar's approach. 

As stated previously, our extension is motivated by a desire to simultaneously achieve dimension 
reduction and ensure data confidentiality. Accordingly, let $ = (</)ij);xp, for / ^ p, where the are 
independent and identically distributed standardized random variables, i.e., such that E{(j)) = and 
Var{(f)) = 1. Throughout this paper we will assume that the (pij have a standard normal distribution. 
The random matrix $ will be used to induce a random projection 

Note that tends to the identity matrix Ii^i when Z,p — )■ oo in an appropriate manner [2]. As a 

result, we see that an intuitive advantage of this projection is that the inner product and the corresponding 
Euclidean distance are essentially preserved, while reducing the dimensionality of the space from / to p. 

Under our intended scenario, rather than observe the original random variable X we instead suppose that 
we see only its projection, which we denote as Y = p^^^'^^^X. Consider now the possibility of applying 
the PCA subspace method in this new data space. Conditional on the random matrix $, the random 
variable Y is distributed as multivariate normal with mean zero and covariance S* = (l/p)$^S<l>. Denote 
the eigen-decomposition of this covariance matrix by S* = UA*U'^, let Y = {UkUl)Y represent the 
prediction of Y by the first k principal components of S*, where Uk is the first k columns of U, and let 
F = F — F be the corresponding residual. Finally, define the squared £2 norm of this residual as 

Q* = Y^Y. 

The primary contribution of our work is to show that, despite not having observed X, and therefore 
being unable to calculate the statistic Q, it is possible, under certain conditions on the covariance S of 
X to apply the PCA subspace method to the projected data Y, yielding the statistic Q*, and nevertheless 
obtain anomaly detection performance comparable to that which would have been yielded by Q, with the 
discrepancy between the two made precise. 
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III. Main Results 

It is unrealistic to expect that the statistics Q and Q* would behave comparably under general conditions. 
At an intuitive level it is easy to see that what is necessary here is that the underlying eigen-structure of S 
must be sufficiently well-preserved under random projection. The relationship between eigen-values and 
-vectors with and without random projection is an area that is both classical and the focus of much recent 
activity. See [T|, for example, for a recent review. A popular model in this area is the spiked covariance 
model of Johnstone ifTSl . in which it is assumed that the spectrum of the covariance matrix S behaves as 

ai> a2. . . > (Tm> CTm+l = . . . = ff; = 1 . 

This model captures the notion - often encountered in practice - of a covariance whose spectrum exhibits 
a distinct decay after a relatively few large leading eigenvalues. 

All of the results in this section are produced under the assumption of a spiked covariance model. We 
present three sets of results: (i) characterization of the mean and variance of Q*, in terms of those of Q, 
in the absence of anomalies; (ii) a comparison of the power of detecting certain anomalies under Q* and 
Q\ and (iii) a quantification of the implications of estimation of S* on our results. 

A. Mean and Variance of Q* in the Absence of Anomalies 

We begin by studying the behavior of Q* when the data are in fact not anomalous, i.e., when X truly is 
normal with mean and covariance S. This scenario will correspond to the null hypothesis in the formal 
detection problem we set up shortly below. Note that under this scenario, similar to Q, the statistic Q* 
is distributed, conditional on $, as a linear combination of p — k independent and identically distributed 

chi-square random variables, with mean and variance given by 

I I 
J2 < and 2 {a*f , 

i=k+l i=k+l 

respectively, where (al, ... ,a*) is the spectrum of S*. Our approach to testing will be to first center 
and scale Q*, and to then compare the resulting statistic to a standard normal distribution for testing. 
Therefore, our primary focus in this subsection is on characterizing the expectation and variance of Q* 



The expectation of Q* may be characterized as follows. 

Theorem 1: Assume /,p — t- oo such that ^ = c + o(p~^/^). If A; > m and o"m > 1 + y/c, then 

Exi^{Qn = Ex{Q) + Op{l) . (2) 

Thus Q* differs from Q in expectation, conditional on $, only by a constant independent of / and p. 
Alternatively, if we divide through by p and note that under the spiked covariance model 

-ExiQ) = — ^c, (3) 
p p 

as Z,p — 7- oo , then from (|2]) we obtain 

-Exi^{Q*)=c + Op{p-') . (4) 
p 

In other words, at the level of expectations, the effect of random projection on our (rescaled) test statistic 
is to introduce a bias that vanishes like Op{p~^). 
The variance of Q* may be characterized as follows. 

Theorem 2: Assume /,p — )• oo such that ^ = c + o{p'^/'^). If k > m and (Xm > 1 + -\/c, then 

That is, the conditional variance of Q* differs from the variance of Q by a factor of (c+ 1), with a relative 
bias term of order Op(p~^/^). 

Taken together. Theorems [T] and [2] indicate that application of the PCA subspace method on non- 
anomalous data after random projection produces a test statistic Q* that is asymptotically unbiased for 
the statistic Q we would in principle like to use, if the original data X were available to us, but whose 
variance is inflated over that of Q by a factor depending explicitly on the amount of compression inherent 
in the projection. In Section |W] we present the results of a small numerical study that show, over a range 
of compression values c, that the approximations in (|4]) and (|5]) are quite accurate. 



B. Comparison of Power for Detecting Anomalies 

We now consider the comparative theoretical performance of the statistics Q and Q* for detecting 
anomalies. From the perspective of the PCA subspace method, an 'anomaly' is something that deviates 
from the null model that the multivariate normal vector X has mean zero and covariance S = VAV^ in 
such a way that it is visible in the residual subspace, i.e., under projection by / — VkVf[ . Hence,we treat 
the anomaly detection problem in this setting as a hypothesis testing problem, in which, without loss of 
generality, 

Ho:fi = Q and ifi : = (0^_^, 7, 0, • • • , 0) , (6) 

d>k 

for n = E{X) and 7 > 0. 

Recall that, as discussed in Section 2, it is reasonable in our setting to approximate the distribution of 
appropriately standardized versions of Q and Q* by the standard normal distribution. Under our spiked 
covariance model, and using the results of Theorems [T] and [2| this means comparing the statistics 

Q-jl-k) Q*-il-k) 
^/2{l - k) ^2{l-k){c+l) ' 

respectively, to the upper 1 — a critical value Zi^a of a standard normal distribution. Accordingly, we 

define the power functions 

- ' ( " ' 

and 

P0WERq.(7) := P I ^ ^ > 1 , (9) 

for Q and Q*, respectively, where the probabilities P on the right-hand side of these expressions refer to 
the corresponding approximate normal distribution. 

Our goal is to understand the relative magnitude of PoweRq* compared to Povv^eRq, as a function of 
7, /, k, c, and a. Approximations to the relevant formulas are provided in the following theorem. 

Theorem 3: Let Z he a standard normal random variable. Under the same assumptions as Theorems [T] 



10 



and [2} and a Gaussian approximation to the standardized test statistics, we have that 

P0WERQ(7) = P(Z>g;.rJi) and PoweRq,(7) = P (Z > Q.^f-i) , 

where 

V2(/ - /c) +472 



while 



rrU Z^.^^/W^)- [lVV^ + Op{l), .... 



v/2(/-fc)+472 + Op(pV2) 



Ignoring error terms, we see that the critical values ( 10 1 and ( 1 1 1 for both power formulas have as their 
argument quantities of the form ciZi^a — C2. However, while ci{Q*) ~ ci{Q), we have that C2{Q*) ~ 
C2{Q)/{c+ 1)^^^. Hence, all else being held equal, as the compression ratio c increases, the critical value 
at which power is evaluated shifts increasingly to the right for Q*, and power decreases accordingly. The 
extent to which this effect will be apparent is modulated by the magnitude 7 of the anomaly to be detected 
and the significance level a at which the test is defined, and furthermore by the size I of the original data 
space. Finally, while these observations can be expected to be most accurate for large I and large 7, in 



the case that either or both are more comparable in size to the Op{p^^'^) and Op{l) error terms in (11), 
respectively, the latter will play an increasing role and hence affect the accuracy of the stated results. 

An illustration may be found in Figure [Tj There we show the power PoweRq. as a function of the 
compression ratio c, for 7 = 10,20,30,40, and 50. Here the dimension before projection is / = 10,000 
and the dimension after projection p = l/c ranges from 10, 000 to 500. A value of A; = 30 was used for 
the dimension of the principle component analysis, and a choice of a = 0.05 was made for the size of the 
underlying test for anomaly. Note that at c = 0, on the far left-hand side of the plot, the value PoWERg* 
simply reduces to PoWERg. So the five curves show the loss of power resulting from compression, as a 
function of compression level c, for various choices of strength 7 of the anomaly. 



Additional numerical results of a related nature are presented in Section IV 



11 




C. Unknown Covariance 

The test statistics Q and Q* are defined in terms of the covariance matrices S and S*, respectively. 
However, in practice, it is unlikely that these matrices are known. Rather, it is more likely that estimates of 
their values be used in calculating the test statistics, resulting, say, in statistics Q and Q* . In the context of 
industrial systems control, for example, it is not unreasonable to expect that there be substantial previous 
data that may be used for this purpose. As our concern in this paper is on the use of the subspace 
projection method after random projection, i.e., in the use of Q*, the relevant question to ask here is what 
are the implications of using an estimate S* for S*. 

We study the natural case where the estimate S* is simply the sample covariance ^(Y — F)(Y — F)^, 
for Y = [Yi, . . . , F„] the p x n matrix formed from n independent and identically distributed copies of 
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the random variable Y and Y their vector mean. Let UA*tj'^ be the eigen-decomposition of S* and, 
accordingly, define Q* = Y^{I - UkUl)Y in analogy to Q* = Y^{I - UkUl)Y. We then have the 
following result. 

Theorem 4: Assume n > p. Then, under the same conditions as Theorem [T] 

ExiUQ*)=ExiQ) + Op{l) (12) 

and 



Furthermore, under the conditions of Theorem [3j the power function 



(13) 



P0WERa.(7) := P I ^ ^L = > 1 (14) 



can be expressed as P > q*-^^"^^, where 



Simply put, the results of the theorem tell us that the accuracy with which compressed PCA subspace 
projection approximates standard PCA subspace projection in the original data space, when using the 
estimated covariance S* rather than the unknown covariance S*, is unchanged, as long as the sample size 
n used in computing S* is at least as large as the dimension p after random projection. Hence, there is an 
interesting trade off between n and p, in that the smaller the sample size n that is likely to be available, 
the smaller the dimension p that must be used in defining our random projection, if the ideal accuracy is 
to be obtained (i.e., that using the true S*). However, decreasing p will degrade the quality of the accuracy 
in this ideal case, as it increases the compression parameter c. 

IV. Simulation 

We present two sets of numerical simulation results in this section, one corresponding to Theorems [T] 
and [2l and the other, to Theorem [3j 
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In our first set of experiments, we simulated from the spiked covariance model, drawing both random 
variables X and their projections Y over many trials, and computed Q and Q* for each trial, thus allowing 
us to compare their respective means and variances. In more detail, we let the dimension of the original 
random variable X be / = 10, 000, and assumed that to be distributed as normal with mean zero and 
(without loss of generality) covariance equal to the spiked spectrum 

(Ti = 50, (72 = 40, (J3 = 30, (74 = 20, (75 = 10, (76 = . . . = 0"i = 1 , 

with m = 5. The corresponding random projections F of X were computed using random matrices 
$ generated as described in the text, with compression ratios c = l/p equal to 20,50, and 100 (i.e., 
p = 500, 200, and 100). We used a total of 2000 trials for each realization of $, and 30 realizations of $ 
for each choice of c (p). 

The results of this experiment are summarized in Table |lj Recall that Theorems [T] and |2] say that the 
rescaled mean E(Q*)/p and the ratio of variances Var(Q*) /Var{Q) should be approximately equal to c 
and c + 1, respectively. It is clear from these results that, for low levels of compression (i.e., c = 20) the 
approximations in our theorems are quite accurate and that they vary little from one projection to another. 
For moderate levels of compression (i.e., c = 50) they are similarly accurate, although more variable. 
For high levels of compression (i.e., c = 100), we begin to see some non-trivial bias entering, with some 
accompanying increase in variability as well. 



c 


P 


E{Q*)/P 


VariQ*)/VariQ) 


20 


500 


19.681(0.033) 


20.903(0.571) 


50 


200 


48.277(0.104) 


50.085(1.564) 


100 


100 


93.520(0.346) 


96.200(3.871) 



TABLE I 

Simulation results assessing the accuracy of Theorems[T|and[2] 

In our second set of experiments, we again simulated from a spiked covariance model, but now with 
non-trivial mean. The spiked spectrum was chosen to be the same as above, but with / = 5000, for 
computational considerations. The mean was defined as in (|6]), with 7 = 20, 30, or 40. A range of 
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compressions ratios c = 1, 2, . . . , 20 were used. We ran a total of 1000 trials for each realization of $, and 
30 realizations of $ for each combination of c and 7. The statistics Q and Q* were computed as in the 
statement of Theorem |3] and compared to the critical value 2:0.95 = 1.645, corresponding to a one-sided 
test of size a = 0.05. 

The results are shown in Figure |2j Error bars reflect variation over the different realizations of $ and 
correspond to one standard deviation. The curves shown correspond to the power approximation PoweRq* 
given in Theorem |3| and are the same as the middle three curves in Figure [T| We see that for the strongest 
anomaly level (7 = 40) the theoretical approximation matches the empirical results quite closely for all 
but the highest levels of compression. Similarly, for the weakest anomaly level (7 = 20), the match is 
also quite good, although there appears to be a small but persistent positive bias in the approximation 
across all compression levels. In both cases, the variation across choice of $ is quite low. The largest 
bias in the approximation is seen at the moderate anomaly level (7 = 30), at moderate to high levels of 
compression, although the bias appears to be on par with the anomaly levels at lower compression levels. 
The largest variation across realizations of $ is seen for the moderate anomaly level. 

V. Discussion 

Motivated by dual considerations of dimension reduction and data confidentiality, as well as the wide- 
ranging and successful implementation of PCA subspace projection, we have introduced a method of 
compressed PCA subspace projection and characterized key theoretical quantities relating to its use as 
a tool in anomaly detection. An implementation of this proposed methodology and its application to 
detecting IP-level volume anomalies in computer network traffic suggests a high relevance to practical 
problems [9|. Specifically, numerical results generated using archived Internet traffic data suggest that, 
under reasonable levels of compression c, it is possible to detect volume-based anomalies (i.e., in units 
of bytes, packets, or flows) using compressed PCA subspace detection at almost 70% the power of the 
uncompressed method. 

The results of Theorem |4] are important in establishing the practical feasibility of our proposed method. 
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Fig. 2. Simulation results assessing the accuracy of Theorem [3] 
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wherein the covariance S* must be estimated from data, when it is possible to obtain samples of size 
n of a similar order of magnitude as the reduced dimension p of our random projection. It would be 
of interest to estabhsh results of a related nature for the case where n <ti p. In that case, it cannot be 
expected that the classical moment-based estimator S* that we have used here will perform acceptably. 
Instead, an estimator exploiting the structure of S* presumably is needed. However, as most methods in 
the recent literature on estimation of large, structured covariance matrices assume sparseness of some sort 
(e-g-» S, [fTSll . [jlBl ). they are unlikely to be applicable here, since S* is roughly of the form cJpxp + W, 
where W is of rank m with entries of magnitude op{p^^). Similarly, neither will methods of sparse PC A 
be appropriate (e.g, [|35l . [fT6l . Il32l . Wl\ ). Rather, variations on more recently proposed methods aimed 
directly at capturing low -rank covariance structure hold promise (e.g., [|26l , [|25l ). Alternatively, the use of 
so-called very sparse random projections (e.g., [|24l\ in place of our Gaussian random projections, would 
yield sparse covariance matrices S*, and hence in principle facilitate the use of sparse inference methods 
in producing an estimate E*. But this step would likely come at the cost of making the already fairly 
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detailed technical arguments behind our results more involved still, as we have exploited the Gaussianity 
of the random projection in certain key places to simplify calculations. We note that ultimately, for such 
approaches to produce results of accuracy similar to that here in Theorem 4, it is necessary that they 
produce approximations to the PCA subspace of S* with order Op(n^^/^) accuracy. 

Finally, we acknowledge that the paradigm explored here, based on Gaussian random projections, is 
only a caricature of what might be implemented in reality, particularly in contexts like computer network 
traffic monitoring. There, issues of data management, speed, etc. would become important and can be 
expected to have non-trivial implications on the design of the type of random projections actually used. 
Nevertheless, we submit that the results presented in this paper strongly suggest the potential success of 
an appropriately modified system of this nature. 

VI. Appendix 

A. Proof of Theorem 1 

Suppose random vector X G has a multivariate Gaussian distribution A^(0, T^ixi) and Y ~ A^(0, S*xp)' 
for S* = ^$'S<I>. Denote the eigenvalues of S and S* as (ai, . . . ,ai) and (a];, . . . , a*), respectively. 
Jackson and Mudholkar [[T3l show that Q = {X - X)'{X - X) will be distributed as ^i^h where 

i=fc+l 

the Zi are independent and identically distributed (i.i.d.) standard normal random variables. Consequently, 
we have £'x(<5) = o-j and, similarly, i?x|<i>(<5*) = o"*. So comparison of and Ejci^lQ*) 

i=fc+l j=fc+l 

reduces to a comparison of partial sums of the eigenvalues of S and S*. 
Since 

Ilk k 

Ex{Q) = ^ (^i = ^(^t-^^t = tr{Q) - ^(^i , 

i=k+l i=l i=l 1=1 

k 

in the following proof we will analyze tr(Q) and ^ (Xj separately. 

i=l 

1) : Because orthogonal rotation has no influence on Gaussian random projection and the matrix 
spectrum, to simplify the computation, we assume without loss of generality that S = diag{ai, (72, ... , ai). 
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So the diagonal elements of E* = i$^E$ are 

1 ' 

i=l 

We have 

j=i 1=1 i=i j=i i=i 

and therefore 

i=i P j=i 

Under the spiked covariance model assumed in this paper, cti > (72 > . . . > cr^ > (7m+i — = 
= CT; = 1 for fixed m. Then, 

m ^ p I ^ p 

i=l ^ j=l i=m+l ^ j=l 

When /,p ^ oo, ^ — >■ c and the first term will go to zero like Op(p~^/^). The second term can be written 
as: 

E E(4-i) 

i=m+l ]=1 

More precisely, here we have a series {In}, {Pn\ satisfying >oo, p^— >oo,^— >c>0 when 
n ^ oo. It is easy to show that Dn = {In — fn)pn oo when n ^ oo. Since the (f)ij are i.i.d., we can 

/ Pn 

re-express the series E E(0fj - 1) as 

i=m+l j=l 

v=l 

Recalling that the are standard normal random variables, we know that E{(f)'^ — 1) = and Var{(f)'^ — 

N 

1) = 2. By the central limit theorem, the series {\/N ^ E (^i" ~ ^)}n=i will converge to a zero mean 

i"=l 

N 

normal in distribution. Hence ^ E i^l" ~ 1) order Op{N~^/'^) when — > oo. As an infinite 

i"=i 

subsequence. 
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also has the same behavior, which leads to 

^ - 1) = Op{D-'/') = Op{[{L - mK]-^/^) , 

i'=l 

by which we conclude that 

(/-m)— ^ E j2(^l-l) = il-m)0pi[{l-m)p]-'/') = 0p{l). 

^ i=m+l j=l 

As a result of the above arguments, 

tr(S*) - tr(S) = Opip-'^') + Op(l) = Op(l). 

2; ; Next we examine the behavior of the first k eigenvalues of S and S*, i.e., {cri . . . ak} and {cr* . . . a^}. 

Recalling the definition of F as F = -^^'X ~ A^(0, we define the / x p matrix Z = 2^/^$. 

All of the columns of Z are i.i.d random vectors from A^(0,S), and the covariance of Y, can 

be expressed as -Z'Z. Let S* = -ZZ' which contains the same non-zero eigenvalues as S* = -Z' Z. 
Through this transformation of F to Z and interpretation of S as the sample covariance corresponding to 
S, we are able to utilize established results from random matrix theory. 

Denote the spectrum of S as (si, . . . ,Sp). Under the spiked covariance model, Baik [1] and Paul [|30ll 
independently derived the limiting behavior of the elements of this spectrum. Under our normal case 
Zi ~ A^(0, S), Paul [l30ll proved the asymptotical normality of Sy. 

Theorem 5: Assume /,p — )■ 00 such that ^ = c + o{p^^/'^). If 0-^, > 1 + ^/c, then 

a^-l [cTy-iy 
For significantly large leading eigenvalues ^ 1, is asymptotically N^a^, "^crl). And for all of the 
lead eigenvalues which are above the threshold 1 + ^/c, we have a* — cr^, = Op{p^^/'^). Recalling the 
condition A; > m in the statement of the theorem, without loss of generality we take A; = m (as we will 
do, when convenient, throughout the rest of the proofs in these appendices). Using Paul's result, we have 

E-r = E^.+op(p-^/^) 

i=l i=l 



19 



Combining these results with those of the previous subsection, we have 

k k 

Exi^m - Ex{Q) = (tr(E*) - tr{T)) + (J] a* - J] a,) = . 

i=l i=l 



B. Proof of Theorem 2 

I 

For notational convenience, denote S* as (Ajjjpxp, so that = ^ {^%)- Writing Q = Yl '^i^h 



i=k+l 



and similarly for Q*, we have 

Varx(Q) = 2( al) and Var;,|^(Q*) = 2( ) = 2(||E*||| - J^af ) 



i=fc+l 



Since E* = E$, we have 



i=A;+l 



Accodingly, if i — j. 



and 



4^ - — 

^ii ~ 2 



1 ' 

Aj = - ^ (t>ih(t>jh(^h 

^ h=l 

1 ' 

h=l 



.h=l 



h^h' 



while if i 7^ j. 



A. 



h+h' 



.h=l 



Changing the order of summation, we therefore have 

I / p 



^* ||2 



1 

\F — ^ 



./i=l \i=l 



which implies, 



h=l \^ i=l / h^h' V i=l / 



(16) 



Now under the spiked covariance model, with k — m, we have 



Var(Q) = 2 ^ = 2 (Z - m) . 



\h=k+l 
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As a result, we have 



Varx|$(Q 



I V* l|2 

1^ \\f 



h=l 



1 



i=A:+l "^i 



Var(g) 

Substituting the expression in equation [T6] yields 

Varx|c(g*) _ 1 



/ — m 



I V* II 2 

1^ IIf 



Var(g) 



/ — m 



/i=l \^ i=l / h=l 

— m ^-^ \ p ^-^ 



(17) 



The control of equation 17 is not immediate. Let us denote the two terms in the RHS of 17 as A and B. 
Results in the next two subsections show that A behaves like 1 + Op{p^^/'^), and B, like c + Op{p^^^'^). 
Consequently, Theorem 2 holds. 



]) : We show in this subsection that 



A 



I — m 



h=l V i=l / h=l 



1 + Opip 



-1/2N 



First note that, by an appeal to the central limit theorem, - J2 ' 



1 + Op{p-^/^). So A can be 



expressed as 



/ — m 



h=l h=l / h=l h=k+l 

Using the result by Paul cited in Section A. 2, in the form of Theorem 6, the first term is found to behave 

like Op{p^^). In addition, it easy to see that the second term behaves like Op{p~^^'^). Finally, since under 

the spiked covariance model (Xm+i = • • • = o"/ = 1, taking k = m we have that 

I 

J2 + Op{p-h] ={l-m)[l + Op{p-'/' 

h=k+l 

As a result, the third term in the expansion of A is equal to 1 + Op{p~^/'^) 
Combining terms, we find that A = 1 + Op{p~^^'^). 



2) : Term B in [17] can be written as 

B 



p^il — m) ^ 

^ ^ ' l<h'<h<l 



(18) 



4 = 1 
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Recalling that under the spiked covariance model ai > a2 ■ ■ ■ > <Jm > cTm+i = . . . = cx/ = 1, in the 
following we will analyze the asymptotic behavior of the term B in two stages, by first handling the case 
ai = a2 = ■■■ = ai = 1 in detail, and second, arguing that the result does not change under the original 
conditions. 

If ai = (72 = . . . = 0"/ = 1, which is simply a white noise model, term B becomes. 



p X 2 



^ ^ ' l<h'<h<l \i=l 

which may be usefully re-expressed as 

2 



J2 E*^*^*^*^' ' ^^^^ 



p^il — m) ^ 

^ ^ ' l<h'<h<l \ i=l i>j 

and, upon exchanging the order of summation, as 

2 A ^ 4 



Yl $^0?h0'/.' + 2 , (20) 



^ ^ ' i=l l<h'<h<l ^ ^ ' i>j l<h'<h<l 



Write equation 21 as B = Bi + B2. In the material that immediately follows, we will argue that, under 
the conditions of the theorem and the white noise model, Bi = c + Op{p-^'^) and B2 = Op{p-^). 
To prove the first of these two expressions, we begin by writing 

T.^Yj^ and B^^^^i^^p) . 

Note that the Tj are i.i.d. random variables. We will use a central limit theorem argument to control Bi. 

A straightforward calculation shows that E{Ti) = 1(1 — l)/2. To characterize the second moment, we 
write 



'ih<pM<f>lH(f>m' 



\l<h'<h<l / h'<h;H'<H 

and consider each of three possible types of terms (pih'pM'P'iH'P'iH' • 

1) If H = h,H' = h', then 4>1h4>1h'4>iH4>iH' = with expectation 9. Since there are /(/ — l)/2 
such choices of (h, h'), the contribution of terms from this case to E{Tf) is 9[/(/ — l)]/2. 

2) If only two of {h, h', H, H') are equal, 4>'ih4>ih'4>iH4>iH' ^^^^ the form 4>ia4>ih4>tc with expectation 
3. For each triple (a, 6, c) there are six possible cases: h = H > h > H',h = H > H' > h,h > h' = 
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H > H',H > H' = h> h',H > h > H' = h',h > H > H' = h'. So there are /(/ - - 2) such 
terms in this case, yielding a contribution of 3/(/ — — 2) to E{Tf). 



3) If (/i, h\ H, H') are all different, the expectation of 'Pih'Pih'^'iH^'iH' i^^^ 1- Since there are 



4 



terms in total, the number of such terms in this case and hence the contribution of this case to E(Tf) 
is -/(/-!)(/ -2). 

Combining these various calculations we find that 



l\l - If , , /(/ - 1) 



E{Tt) = ' ^ ' + 8 + 2/(/ - 1)(/ - 2) 

and hence 

Var(Ti) = EiT^) - EiT^f = 2l\l - 1) . 

By the central limit theorem we know that ^ [f - E[T]) /^JVar{T) = Op(l). Exploiting that Bi = 
[2/p{l — m)]T and recalling that l/p = c + o(p~^/^) by assumption, simple calculations yield that Bi = 

As for B2, it can be shown that £'(-82) = and 

VarTO= ^lfj~% =o{p-') , 

from which it follows, by Chebyshev's inequality, that B2 = Op{p~^). 

Combining all of the results above, under the white noise model, i.e., when ai = a2 = . . . = ai = 1, 
we have B = c + Op{p^^^'^) . In the case that the spiked covariance model instead holds, i.e., when 



0"! > (72 • • • > > cTm+i = . . . = 0"^ = 1, it Can bc shown that the impact on equation 18 is to introduce 
an additional term of op(p~^). The effect is therefore negligible on the final result stated in the theorem, 
which involves an Op{p~^^'^) term. Intuitively, the value of the first m eigenvalues will not influence 



the asymptotic behavior of the infinite sum in 18 which is term B. 



C. Proof of Theorem 3 

Through a coordinate transformation, from X to V'^X, we can, without loss of generality, restrict our 
attention to the case where X ~ N{ix, S), for E = diag(cri, . . . , ai), and our testing problem is of the 
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form 

Ho: fi = vs //i : /i = (0, . . . , 0, 7, 0, . . . , 0)^ 

d>k 

for some 7 > 0. In other words, we test whether the underlying mean is zero or differs from zero in a 
single component by some value 7 > 0. 
Consider first the expression for PoweRq in ([8]). Under the above model, 

j=k+i 

where the Zj are independent N{fij,l) random variables. So, under the alternative hypothesis, the sum of 
their squares is a non-central chi-square random variable, on / — A; degrees of freedom, with non-centrality 
parameter . . . , /x;)^||2 = 7^. We have by standard formulas that 

E[Q] = {l-k)+^' 

and 

Var(g) = 2(/ -k)+ 47^ . 
Using these expressions and the normal-based expression for power defining ([8]), we find that 

P0WERq(7) =fIz> 



(l-k) 7 



2 



{l-k) + 272 ^2(/-A;)+472 
as claimed. 

Now consider the expression for PoweRq* in (9), where we write Q* = Y'^MY for Y = and 
M = (I — UkUk^). Under the null hypothesis, we have 

1 



and 



ExM*) =tr M-$^S$ 



YarxM*) = 2tr ( M-$^E$M-$^E$ 
\ P P 

Call these expressions e and u, respectively. Under the alternative hypothesis, the same quantities take the 
form 
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and 



Varx|$(g*) = z/ + 472i , 



respectively, where 



and 



7^^ = -u'^$[M(-$^E$)Ml$V (22) 
P P 



^^B = -/i^*^*'^/^ • (23) 
P 



Arguing as above, we find that 



P0WERq.(7) = F\ Z > zi. 



V + 47M V + 472^ 

By Theorem [2| we know that z/ = 2(/ — k){c + 1 + Op(p^^/^)). Ignoring the higher-order stochastic 
error term we therefore have 



(c +!)(/- A;) 725 



P0WERo.(7)=P Z>Zi_a;, 

' " +!)(/- A:) + 27M ^2(c+l)(/-fc)+47M^ 

This expression is the same as that in the statement of Theorem |3} up to a re-scaling by a factor of c+ 1. 
Hence it remains for us to show that 

i = c+l + Op(j9"^/2) and £ = 1 + Op(p"^/^) . 

Our problem is simplified under transformation by the rotation $ — )• $0, where Opxp is an arbitrary 
orthonormal rotation matrix. If we similarly apply 

S* ^ 0^S*0, f/ ^ O^V, andM ^ O^MO, 



then A and S remain unchanged in (22) and (23). Recall that E* = UA*U'^, where A* = diag{si, . . . , Sp), 
and |U = (0, . . . , 0, 7, 0, . . . , 0)-^, with 7 in the + 1 > A; location. Choosing O = U and denoting 
$t/ = (%), straightforward calculations yield that 
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and 

p 



1 



Now write $^ = [$|^, ^l/fc]"^, where $fc denotes the first k rows of $, and ^fc, the last / — k rows. 
The elements r]dj in the two sums immediately above lie in the d-th row of the product of \E'fc and the 
last p — k columns of U. By Paul [|30l Thm 4], we know that if am, the last leading eigenvalue in the 
spiked covariance model, is much greater than 1, and l,p oo such that ^ = c + o(p^^/^), then the 
distance between the subspaces span{$fc} and span{f/fc} diminishes to zero. Asymptotically, therefore, 
we may assume that these two subspace coincide. Hence, since is statistically independent of it 
follows that \E'fc is asymptotically independent of Uk, and therefore of the orthogonal complement of Uk, 
i.e., the last {p — k) columns of U. As a result, the elements in {r]d,k+i-, • • • , Vd,pY behave asymptotically 
like independent and identically distributed standard normal random variables. Applying Chebyshev's 
inequality in this context, it can be shown that 

i = c + 1 + Op{p-^''^) and B = l + Opij)-^''^). 

Rescaling by (c + 1), the expressions for A and B in Theorem 4 are obtained. 

D. Proof of Theorem 4 

Let M = / - UkUl and M = / - f/^f/f . If we use the sample covariance f* = ^(Y - F)(Y - to 
estimate S*, we will observe the residual Q* = Y^MY instead of Q* = Y^MY. To prove the theorem it 
is sufficient to derive expressions for Ex\<s>{Q*) and Varx|<i.(Q*) under the null and alternative hypothesis 
in (|6]), as these expressions are what inform the components of the critical value in the power calculation. 
Our method of proof involves re-expressing Ex\<i>{Q*) and Varx|$((0*) in terms of M and M — M and 
showing that those terms involving the latter are no larger than the error terms associated with the former 
in Theorems 1, 2, and 3. 

Begin by considering the mean and writing 
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We need to control the term 



Exi'S'iQ* — Q* 



Exi^[Y^{M-M)Y] 



tr 



(M - M)S* + -/i^$(M - M)$V 
P 



(24) 



Under the null hypothesis the second term in (24) is zero, and so to prove (12) we need to show that the 
first term is Op(l). 

Note that, without loss of generality, we may write S* = S* + T^l, where = (l/p)<lf Ai<i>i and 
Eg = (l/p)<i'2 for <i>^ = [<i>^, ^ random matrix of independent and identically distributed standard 
Gaussian random variables and Ai = diag(cri, . . . , cr„J. Then using [|29l Thm II. 1], with D = — S2 in the 
notation of that paper, it follows that 



tr 



(M - M)S* < max (^|Ai(M - M)|, |Ap(M - M)|) [tr(S*) - tr(S;)] + tr (M - M)E 



(25) 



where we use Aj(-) generically here and below to denote the i-th largest eigenvalue of its argument. 
For the second term in the right-hand side of (25), write M — M = UkU"^ — UkUk ■ Using a result 



attributed to Mori (appearing as Lemma I.l in [|29l ), we can write 

\p{T.l)triUkUl) < triUkUlT.1) < X,{J:i)tr{UkU^) , 

and similarly for UkU'^ in place of UkU^ ■ Exploiting the linearity of the trace operation and the fact that 

rank(?7jfcf/J) = rank([/fcf/J) = k, we can bound the term of interest as 

tr[(M-M)S;]| <fc[Ai(S;)-Ap(S;)] . 

However, Ai and Ap are equal to c + o(p~^/^) times the largest and smallest eigenvalues of a sample 
covariance of standard Gaussian random variables, the latter which converge almost surely to the right 
and left endpoints of the Marchenko-Pastur distribution (e.g., HI), which in this setting take the values 

[1 + {l/cY'^f and [1 - (l/c)i/2]2^ respectively. Hence, tr[{M - M)S*] = Op(l). 



Now consider the factor tr(S*) — ^^(Sg) in the first term of the right-hand side of (25). We have 



shown that tr(S*) = tr(E) + Op(l). At the same time, we note that tr{T.l) = 1{1 + Op{{plY^/'^), being 
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proportional to the normalized trace of a matrix whose entries are independent and identically distributed 
copies of averages of / — m independent and identically distributed chi-square random variables on one 
degree of freedom. Therefore, and recalling the spiked covariance model, we find that tr(S*) —tr{T.2) = 

At the same time, the factor multiplying this term, i.e., the largest absolute eigenvalue of M — M, is 
just the operator norm ||M — M||2 and hence bounded above by the Frobenius norm, ||M — MHi?. We 
introduce the notation Pj for the j-th column of U times its transpose, and similarly, Pj, in the case of 
U. Then M-M = T!]=i{Pj - Pj) and 

k 

\\M-M\\f<Y,\\Pj-Pj\\^ ■ 
i=i 

To bound this, we use a result in Watson |!34', App B, (3.8)], relying on a multivariate central limit theorem. 



trjP.GPkG) 

j - Sky 



in distribution, as n — )■ oo, where G is a random matrix whose distribution depends only on E* and recall 

(si, . . . ,Sp) are the eigenvalues of S*. So ||M — M\\2 = Op{n^^/'^). 



Therefore, the left-hand side of (25) is Op(l) and (12) is established. Now consider the second term 



in (24), which must be controled under the alternative hypothesis. This is easily done, as we may write 



P 



< -||$Vll2 \\M-M\\2 
P 



and note that the first term is Op(l) while the second is Op{n ^/^). Therefore, under the assumption that 



n 



> p, the entire term is Op{p ^/^), which is the same order of error to which we approximate 5 in (23) 



in the proof of Theorem 3. Hence, the contribution of the mean to the critical value in (15), using S*, is 



the same as in (11), using E* 



This completes our treatment of the mean. The variance can be treated similarly, writing 



y^xM*) = y^xM*) + y^xM* - q*) + 2CowxM*, q* - q* 
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and controling the last two terms. The first of these two terms takes the form 

.12 4 



P 



Varx|$(Q* -Q*) = 2tr (M - M)S 
and the second, 

Co\ x\<!>{Q*, Q* -Q*) = 2tr MTr{M-M)^ 



{M - M)T.*{M - M) 



(26) 



P 



mj:*{m-m) 



(27) 



Again, under the null hypothesis, the second terms in (26) and (27) are zero. Hence, to establish (13), 



it is sufficient to show that the first terms in (26) and (27) are Op{p^^'^). We begin by noting that 



tr 



(M - M)S* 



< tr 



(M-M)^(S 



< tr 



{M-Mf tr[(E*)2] 



where the first inequality follows from [5, Thm 1], and the second, from Cauchy-Schwartz. Straightforward 
manipulations, along with use of 1291 Lemma I.l], yields that tr{M-Mf < 2k\\M - M\\2 = Opin'^/'^). 
At the same time, we have that 

tr(S*)2 < Ai(E*) tr(S*) = [Ai(S) + Op{p-^'^)] [tr(S) + Op(l)] = Op{l) . 

Therefore, under the assumptions that n > p and l/p = c + o{p^^^'^), we are able to control the relevant 



error term in (26) as Op{n~'^^^)Op{l) = Op{p^/^). 



Similarly, using [|29l Lemma LI] again, we have the bound 



tr 



MJ:*{M-M)^* < ||(M- M)S*||2 tr(MS*) 



The first term in this bound is Op{n ^/'^), while the second is Op{l), which allows us to control the 
relevant error term in (27) as Op{p^^'^). As a result, under the null hypothesis, we have that Yarx\q>{Q*) = 
Varx|$(Q*) + Op{p^^^), which is sufficient to establish (fTsl), since YavxiQ) = 0{l) = 0{p). 



Finally, we consider the second terms in (26) and (27), which must be controled as well under the 
alternative hypothesis. Writing 



{M - M)Tr{M - M) $'^'/x< ||<l'Vll2ll^-^ll2p*ll2 



and 



MS*(M-M) $> < ||<I>>||^||M-M||2||S*||2||M||2 
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it can be seen that we can bound the first of these expressions by Op(l), and the second, by Op(p^/^). 



Therefore, the combined contribution of the second terms in (26) and (27) is Op{p ^/^), which is the 



same order to which we approximate A in (22) in the proof of Theorem 3. Hence, the contribution of the 



variance to the critical value in (15), using S*, is the same as in (11 ), using S*. 
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